Welcome back to deep learning. So today we want to discuss in the last part about weekly and self-supervised learning a couple of new losses that can also help us with the self-supervision.
So let's see what we have on our slides. Today part number four and the idea is contrastive self-supervised learning.
So in contrastive learning you try to formulate the learning problem as a kind of matching approach.
So here we have an example from supervised learning and the idea is then to match the correct animal with respect to other animals.
So the learning task is whether the animal is the same or a different one.
And this is actually a very powerful of training because you can avoid a couple of disadvantages in generative or context models.
So for example pixel level losses could overly focus on pixel based details and pixel based objectives often assume pixel independence.
And this reduces the ability to model correlations or complex structures.
So here we essentially can then build abstract models that are also built in a kind of hierarchical way.
Now we have this supervised example but of course this also works with many of the different pseudo labels that we've seen earlier.
So we can then build this contrastive loss and all that we need is the current sample X, some positive sample X plus and then negative samples that are all from a different class.
In addition we need a similarity function S and this could for example be a cosine similarity.
You could also have a trainable similarity function but generally you can also use some of the standard similarity functions that we already discussed in this lecture.
Furthermore you want to apply then your network F on X and compute the similarity and the goal is then to have a higher similarity between the positive sample and the sample under consideration and all the negative ones.
This then leads to the contrastive loss. Sometimes this is also called the info normalized cross entropy loss and there's also other names in the literature.
The end payer loss, the consistency loss, the ranking based NCE and so on.
It's essentially a cross entropy loss for an N-way softmax classifier and the idea here is that you then have essentially the positive examples.
And this is then normalized by all of the examples here.
I'm splitting the two still in the first time but you can then simply see that the split is actually a sum of all of the samples and this then yields a kind of softmax classifier.
So minimizing the contrastive loss actually maximizes a lower bound on the mutual information between F and F plus as shown in those two references here.
And there's also of course a common variation that introduces a temperature hyperparameter that is shown in this example so you divide by an additional tau.
The contrastive losses are very effective and they have two very useful properties.
On the one hand they align so similar samples have similar features and on the other way they create uniformity because they preserve the maximal information.
Let's look into an example how this can be constructed and this is reference 31.
So let's say you start with a mini-badge of N samples.
Then each sample is applied with two different data augmentation operations.
Then this leads to two N augmented samples.
So for every sample in the batch you get two different augmentations.
Here we take T and T prime that we apply to the same sample and you know that this sample is then the same.
And your transformations T and T prime are taken from a set of augmentations T.
Now you end up with one positive pair for each sample and two times N minus one negative pairs because they're all different samples.
And then you compute a representation through the base encoder F that then produces some H which is the actual feature in presentation that we're interested in.
And an example for F could be like a ResNet-50.
Then on top of this we have a representation head G that then does an additional dimensionality reduction.
And note that both G and F are the same in both branches.
So you could argue this approach has considerable similarities to what is called a Siamese network.
So you then obtain two different Zs, Zi and Zj from G. And with those Zs you can compute your contrastive loss that is then expressed as the similarity between the two Zs over the temperature parameter tau here.
And you see this is essentially the same contrastive loss that we already seen earlier.
Contrastive losses of course can also be combined in a supervised way and this then leads to supervised contrastive learning.
And here the idea is that if you just perform the self-supervised contrastive you have positive examples versus the negative ones.
And with the supervised contrastive loss you can then also embed additional class information.
So this has additional positive effects.
So let's see how this works. So let's summarize a bit what the difference between supervised learning, contrastive learning and supervised contrastive learning.
Well in supervised learning you would essentially have your encoder that is shown here as this trapeze.
Then you end up with some description that is 2048 dimensional vector and you then train a classification head that produces the class dog using the classical cross entropy loss.
Now in the contrastive learning you would then expand on this. So you would have essentially two coupled networks and you have the two patches that are either the same or not the same with a different augmentation technique.
And you train these coupled weight shared networks, produce 2048 dimensional representations and on top you have this representation head and you use the contrastive loss on this representation head, not directly on the representation layer itself.
Now if you combine the two into supervised contrastive you essentially have the same setup as in the contrastive loss but on the representation layer you can then augment an additional loss that works strictly supervised.
And there you still have the typical softmax that goes to let's say a thousand classes and predicts dog in this example and you couple it on the representation layer to be able to fuse contrastive and supervised losses.
So the self supervised has no knowledge about the class labels and it only knows about one positive example. The supervised has knowledge about all the class labels and has many positives for example.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:15:17 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 23:16:25
Sprache
en-US
Deep Learning - Weakly and Self-Supervised Learning Part 4
In this video, we look into contrastive losses and how they can be used in combination with self-supervised learning.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning
References
[1] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, et al. “3d u-net: learning dense volumetric segmentation from sparse annotation”. In: MICCAI. Springer. 2016, pp. 424–432.
[2] Waleed Abdulla. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. Accessed: 27.01.2020. 2017.
[3] Olga Russakovsky, Amy L. Bearman, Vittorio Ferrari, et al. “What’s the point: Semantic segmentation with point supervision”. In: CoRR abs/1506.02106 (2015). arXiv: 1506.02106.
[4] Marius Cordts, Mohamed Omran, Sebastian Ramos, et al. “The Cityscapes Dataset for Semantic Urban Scene Understanding”. In: CoRR abs/1604.01685 (2016). arXiv: 1604.01685.
[5] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern classification. 2nd ed. New York: Wiley-Interscience, Nov. 2000.
[6] Anna Khoreva, Rodrigo Benenson, Jan Hosang, et al. “Simple Does It: Weakly Supervised Instance and Semantic Segmentation”. In: arXiv preprint arXiv:1603.07485 (2016).
[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017). arXiv: 1703.06870.
[8] Sangheum Hwang and Hyo-Eun Kim. “Self-Transfer Learning for Weakly Supervised Lesion Localization”. In: MICCAI. Springer. 2016, pp. 239–246.
[9] Maxime Oquab, Léon Bottou, Ivan Laptev, et al. “Is object localization for free? weakly-supervised learning with convolutional neural networks”. In: Proc. CVPR. 2015, pp. 685–694.
[10] Alexander Kolesnikov and Christoph H. Lampert. “Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation”. In: CoRR abs/1603.06098 (2016). arXiv: 1603.06098.
[11] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, et al. “Microsoft COCO: Common Objects in Context”. In: CoRR abs/1405.0312 (2014). arXiv: 1405.0312.
[12] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, et al. “Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization”. In: CoRR abs/1610.02391 (2016). arXiv: 1610.02391.
[13] K. Simonyan, A. Vedaldi, and A. Zisserman. “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”. In: Proc. ICLR (workshop track). 2014.
[14] Bolei Zhou, Aditya Khosla, Agata Lapedriza, et al. “Learning deep features for discriminative localization”. In: Proc. CVPR. 2016, pp. 2921–2929.
[15] Longlong Jing and Yingli Tian. “Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey”. In: arXiv e-prints, arXiv:1902.06162 (Feb. 2019). arXiv: 1902.06162 [cs.CV].
[16] D. Pathak, P. Krähenbühl, J. Donahue, et al. “Context Encoders: Feature Learning by Inpainting”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 2536–2544.
[17] C. Doersch, A. Gupta, and A. A. Efros. “Unsupervised Visual Representation Learning by Context Prediction”. In: 2015 IEEE International Conference on Computer Vision (ICCV). Dec. 2015, pp. 1422–1430.
[18] Mehdi Noroozi and Paolo Favaro. “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles”. In: Computer Vision – ECCV 2016. Cham: Springer International Publishing, 2016, pp. 69–84.
[19] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. “Unsupervised Representation Learning by Predicting Image Rotations”. In: International Conference on Learning Representations. 2018.
[20] Mathilde Caron, Piotr Bojanowski, Armand Joulin, et al. “Deep Clustering for Unsupervised Learning of Visual Features”. In: Computer Vision – ECCV 2018. Cham: Springer International Publishing, 2018, pp. 139–156. A.
[21] A. Dosovitskiy, P. Fischer, J. T. Springenberg, et al. “Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 38.9 (Sept. 2016), pp. 1734–1747.
[22] V. Christlein, M. Gropp, S. Fiel, et al. “Unsupervised Feature Learning for Writer Identification and Writer Retrieval”. In: 2017 14th IAPR International Conference on Document Analysis and Recognition Vol. 01. Nov. 2017, pp. 991–997.
[23] Z. Ren and Y. J. Lee. “Cross-Domain Self-Supervised Multi-task Feature Learning Using Synthetic Imagery”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. June 2018, pp. 762–771.
[24] Asano YM., Rupprecht C., and Vedaldi A. “Self-labelling via simultaneous clustering and representation learning”. In: International Conference on Learning Representations. 2020.
[25] Ben Poole, Sherjil Ozair, Aaron Van Den Oord, et al. “On Variational Bounds of Mutual Information”. In: Proceedings of the 36th International Conference on Machine Learning. Vol. 97. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR, Sept. 2019, pp. 5171–5180.
[26] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, et al. “Learning deep representations by mutual information estimation and maximization”. In: International Conference on Learning Representations. 2019.
[27] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. “Representation Learning with Contrastive Predictive Coding”. In: arXiv e-prints, arXiv:1807.03748 (July 2018). arXiv: 1807.03748 [cs.LG].
[28] Philip Bachman, R Devon Hjelm, and William Buchwalter. “Learning Representations by Maximizing Mutual Information Across Views”. In: Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 15535–15545.
[29] Yonglong Tian, Dilip Krishnan, and Phillip Isola. “Contrastive Multiview Coding”. In: arXiv e-prints, arXiv:1906.05849 (June 2019), arXiv:1906.05849. arXiv: 1906.05849 [cs.CV].
[30] Kaiming He, Haoqi Fan, Yuxin Wu, et al. “Momentum Contrast for Unsupervised Visual Representation Learning”. In: arXiv e-prints, arXiv:1911.05722 (Nov. 2019). arXiv: 1911.05722 [cs.CV].
[31] Ting Chen, Simon Kornblith, Mohammad Norouzi, et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: arXiv e-prints, arXiv:2002.05709 (Feb. 2020), arXiv:2002.05709. arXiv: 2002.05709 [cs.LG].
[32] Ishan Misra and Laurens van der Maaten. “Self-Supervised Learning of Pretext-Invariant Representations”. In: arXiv e-prints, arXiv:1912.01991 (Dec. 2019). arXiv: 1912.01991 [cs.CV].
33] Prannay Khosla, Piotr Teterwak, Chen Wang, et al. “Supervised Contrastive Learning”. In: arXiv e-prints, arXiv:2004.11362 (Apr. 2020). arXiv: 2004.11362 [cs.LG].
[34] Jean-Bastien Grill, Florian Strub, Florent Altché, et al. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning”. In: arXiv e-prints, arXiv:2006.07733 (June 2020), arXiv:2006.07733. arXiv: 2006.07733 [cs.LG].
[35] Tongzhou Wang and Phillip Isola. “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”. In: arXiv e-prints, arXiv:2005.10242 (May 2020), arXiv:2005.10242. arXiv: 2005.10242 [cs.LG].
[36] Junnan Li, Pan Zhou, Caiming Xiong, et al. “Prototypical Contrastive Learning of Unsupervised Representations”. In: arXiv e-prints, arXiv:2005.04966 (May 2020), arXiv:2005.04966. arXiv: 2005.04966 [cs.CV].